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Abstract — Clustering is also known as data segmentation 
aims to partitions data set into groups, clusters, according to 
their similarity. Cluster analysis has been extensively studied in 
many researches. There are many algorithms for different types 
of clustering. These classical algorithms can't be applied on big 
data due to its distinct features. It is a challenge to apply the 
traditional techniques on large unstructured data. This study 
proposes a hybrid model to cluster big data using the famous 
traditional K-means clustering algorithm. The proposed model 
consists of three phases namely; Mapper phase, Clustering Phase 
and Reduce phase. The first phase uses map-reduce algorithm to 
split big data into small datasets. Whereas, the second phase 
implements the traditional clustering K-means algorithm on each 
of the spitted small data sets. The last phase is responsible of 
producing the general clusters output of the complete data set. 
Two functions, Mode and Fuzzy Gaussian, have been 
implemented and compared at the last phase to determine the 
most suitable one. The experimental study used four benchmark 
big data sets; Covtype, Covtype-2, Poker, and Poker-2. The 
results proved the efficiency of the proposed model in clustering 
big data using the traditional K-means algorithm. Also, the 
experiments show that the Fuzzy Gaussian function produces 
more accurate results than the traditional Mode function. 

Keywords—Big Data; MapReduce; Fuzzy Gaussian; K- 
means. 

1. INTRODUCTION 

Data mining is a mechanism extracting the information 
from data. It is challenging to get relevant information and 
provide it within shortage time [4]. In data mining; supervised 
learning and unsupervised learning are the two learning 
approaches utilized to mine data [5]. In the Supervised 
learning; data includes both input and the desired outcome. The 
desired results are known and are given in inputs to the model 
during the learning procedure. The neural network, Multilayer 
perception, Decision tree are examples of supervised models. 
On the other hand in unsupervised learning. The desired 
outcome is not given to the model during the learning 
procedure. This method can be used to cluster the input data in 
classes by their statistical properties only. These models are for 
the various type of clustering, k-means, distances and 
normalization, self-organizing maps. 

Data mining had some algorithms like classification, 
clustering, regression and association rule. Clustering is a task 
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to group data by their similarities and dissimilarities from data 
elements; mainly it is difficult at the time of big dataset. 
Clustering method converts that information into various 
clusters where the object in that group has similar properties as 
compared to other but not same to other clusters properties. 

The rest of this artical is planned as follows. Section II 
presents related works about the k-means algorithm. Part III 
contains brief discussion about algorithms which used a new 
model such as map-reduce, k-means, mod & Gaussian. The 
proposed algorithm is explained in Section IV. Section V 
introduces the experimental results by using four big data 
namely; Covtype, Covtype-2, Poker Hand and Poker Hand-2. 
In Section VI, we discuss the actual importance of the model 
in the conclusion section. 

2. RELATED WORK 

Clustering is a process for partitioning datasets. It is the 
grouping of a particular set of objects based on their 
characteristics, aggregating them according to their similarities 
[2] [14]. This technique is helpful for an optimum solution. K- 
mean is the most famous clustering method. Mac Queen in 
1967, firstly introduced this algorithm, though the idea went 
back to Hugo Steinhaus in 1957 [3]. 

Y. S. Thakare et al. [6] discussed the performance of k- 
means algorithm which is evaluated with various datasets such 
as "Iris," "Wine," "Vowel," "Ionosphere" and "Crude oil" 
dataSets and different distance metrics. It is assumed that 
performance of k-means clustering depends on the datasets has 
been used as distance metrics. The k means clustering 
algorithm is evaluated using recognition rate for a different 
number of the cluster. This work assisted in choosing suitable 
distance metric for an appropriate purpose. 

SK Ahammad Fahad [7] proposed a method for making the 
algorithm which is consuming time effective and efficient for 
reduced complexity. The quality of their resulting clusters 
heavily depends on the selection of initial centroid and changes 
in data clusters in the subsequence iterations. After a definite 
number of iterations, a small part of the data points changes 
their clusters. Their approach; first gets the initial centroid and 
sets intervals between those data elements which will not 
exchange their cluster and those which may exchange their 
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cluster in the subsequence iterations. So, it will decrease 
significantly in case of large datasets. 

Two methods for clustering the large datasets using 
MapReduce has presented in [8]. Firstly, "K-Means Hadoop 
MapReduce (KM-HMR)" which focused on the MapReduce 
implementation of regular K-means. The second one improves 
the clusters quality to create clusters with distances that are 
maximum in intra-cluster and minimum in inter-cluster for 
large datasets. The results of their introduced methodology 
present enhancement in the execution time efficiency of 
clustering. Experiments executed on original K-means, and 
proposed model shows that their approach is both powerful and 
efficient. 

Mugdha et al. [20] introduce an approximate algorithm 
based on k-means. The algorithm minimizes the complexity 
measure of k-means by calculating over only those attributes 
which are of interest is proposed here. Their algorithm cannot 
manipulate categorical data completely until it is transformed 
it into equivalent numerical data. Manhattan distance concept 
has been practiced, which in turn decreases the runtime. It is a 
new method for big data analysis. Their algorithm is scalable, 
very fast, and have great accuracy. It succeeded to overcome 
the disadvantage of k-means of an uncertain number of full 
iterations. They set a fixed number of iterations, without losing 
the precision. 

G. Venkateshl et al.[21] Present a method is called layers 
three aware traffic clustering based on parallel K-means and 
the distance metric for minimizing the network traffic cost. 
Their method applied map-reduce model in three layers. 
Various algorithms are discussed in their paper; they compared 
between it, e.g., "Bisecting K-Means", "K-Means Parallel", 
"Basic K-Means", and "DB-Scan". Their proposed method was 
done on the same data sets to calculate their execution time and 
accuracy. It enhances performance by reducing the network 
traffic using partition, aggregation. 

Jerril M. et al. [22] design a proposed algorithm of the 
parallel K-means algorithm based on map reduce on Hadoop. 
Their paper compared the performance of evaluation criteria 
called speedup, scale-up, and Size-up. Speedup tries to 
evaluate the efficiency of the parallelism to improve the 
execution time. Scale-up checks the ability of it to grow both 
the Map-Reduce system and the data size, that is, the scalability 
of the Map-Reduce tool. Size-up estimates the capacity of it to 
handle growth. It estimates measurements that take to execute 
the parallel tasks. According to their opinion, the parallel 
implementation of K-Means gives better results than sequential 
K-Means algorithm. 

3. Map Reduce paradigm 

Map Reduce is the software paradigm for processing larger 
massive and scalable dataset in the cluster. Map Reduce model 
processes the unstructured dataset available in a clustering 
format. Map Reduce is a most popular model used for 
processing a large set of the data in a parallel and distributed 
clustering algorithm. It offers numbers of benefits to handle 
large datasets such as scalability, flexibility and fault tolerance. 
The map-reduce framework is widely used in processing and 
managing large data sets. It is also used in such applications 


like document clustering, access log analysis, and generating 
search indexes. MapReduce is a processing technique and a 
program model for distributed computing based on java. The 
MapReduce algorithm contains two essential tasks, namely 
Map and Reduce. 

A. Mapper Phase 

The Map-Reduce framework is commonly used to analyze 
enormous datasets like tweets sets, online texts or large scale 
graphs. The Mapper and Reduce are two essential phases in 
MapReduce algorithm. Firstly, the mapper phase starts the 
execution of the map-reduce program. The large dataset that 
passed into a mapping function to create similar small datasets 
which called chunk [12]. The Mapper uses a list of key/value 
pairs, then processes all of it. The mapper produces zero or 
more (key/value) pairs. The mapper phase output contains the 
key and the value of the number of instances that lied in the 
dataset. This structure gives a smooth and stable interface for 
programmers to resolve large-scale clustering difficulties. 

Algorithm 1 : Mapper Function _ 

Store samples dataset 
Do 

Read mapper-data from samples dataset one hy one 
Do 

clustered data=k-means( samples dataset, K, 

distance); 

Send clustered data to the Reducer 
End Of Mapper-Data 
While end-of-file 
Call reducer 

End _ 

B. k-means Clustering Phase 

Clustering is considered a core task of exploratory data 
analysis and applications of data mining. Clustering task is 
grouping objects’ sets in a way that objects in the same group 
(a cluster) are similar to one another than to those in other 
groups (clusters) [9] [11]. The Partition clustering is a widely 
method where a number of objects are set and the data sets are 
partitioned into a number of clusters in which each cluster 
includes similar objects. 

The k-means algorithm is used extensively for clustering 
large datasets. The concept is classifying a presented set of data 
into k number of disjoint clusters, in which the value of k is 
fixed in advance. The k-means algorithm [6] is effective for 
many practical applications in producing clusters. However, 
the traditional k-means algorithm is extremely high in 
computational complexity, particularly for large sets of data. 
Moreover, different types of clusters result from this algorithm 
depending on the random choice of initial centroids. Many 
attempts were made by researchers to improve the k-means 
clustering algorithm performance. This paper proposed a 
method for improving the accuracy and efficiency of the k- 
means algorithm. It is used widely due to the ability to produce 
better cluster results compared to other clustering techniques 
plus its fast computation. 
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Algorithm 2 : K-means _ 

Given: dataset of element (el, e2... en).K: no 
clusters, 

Target :Split the n data elements into k (<n) 
partitions P = {PI, P2, ..., Pk} 

1. Set Initial mean value for k cluster randomly. 

2. Assign each data element to closest mean. 

(t)=fe p :||ep-m[ t) || <||e p —II Vj,l<j<k\, 

Pi 

3. When data elements have been assigned, the 
centroid of each of the k clusters becomes the new 
mean. 



4. Repeat Steps 2 and 3 until when the assignments no 
longer change. _ 


C. Reducer Phase 

The reducer phase is the main second part of map reduce. 
It is responsible for collecting the results coming from 
mappers. The reducer has three steps; Shuffle, Sort and 
Reduce. 

Shuffle step which receives the output from a mapper phase 
as input and merges these result tuples into a smaller collection 
of tuples. In the sort, step values are sorted according to the 
key. Shuffle and sort process is sent in parallel. The last step 
here calls the reduce method that takes <key, list of 
corresponding value> pair and produces the output into the file 
system. 

The reduce phase created a single output. There are 
multiple reducers to parallelize the aggregations. Finally, 
MapReduce is considered easier to scale data processing over 
various computing nodes. 

Algorithm 3: Reducer function _ 

1. Store clustered data 

2. Generate cluster label Vector for clustered data 

3. Generate Output Matrix M x D, where M is mappers 
no. & D is clusters no. 

4. Initialize Output label Matrix to all cluster. 

5. get output from all MAPPERS 

While hasnext (intermediateValuesIn) 

Put outcome from MAPPERS into output 
Matrix(i)= Output 

Allocate cluster label according to cluster vote 
function Cluster label=Cluster vote(output) 

End While 


1) Fuzzy Gaussian membership function 

The Gaussian fuzzy membership function is considerably 
famous in the fuzzy logic literature. It considered the main 
connection between the fuzzy systems and the radial basis 
function (RBF) of neural networks. Also, the Gaussian is used 
to represent vague, linguistic terms. It focuses on an adaptive 
distance measure; it can adapt the distance norm to the 


underlying distribution of the data which is presented in the 
different sizes of the clusters [1].Gaussian functions are 
exercised in statistics to describe the standard distributions. It 
used in signal processing to represent Gaussian filters. In image 
processing where two-dimensional Gaussians are performed 
for Gaussian blurs, in mathematics to solve equations and 
diffusion equations to define the Weierstrass transform. 

Algorithm 4 :Fuzzy Gaussian Membership _ 

1. get label Cluster matrix from all mappers 

2. Generate a matrix MxN contains cluster label 
M is cluster no. N is number of mappers 

3. Do 

1. Compute "mean" & "standard deviation" for 
every clustered data 

2 E(x t -/i ) 2 

° N 

2. Compute "Membership function" for every 
cluster 

( x-n 

Pi O) = e 1 o- j 

i = 1,2,.... M, and M is number of clusters 

3. allocate the cluster label with the greatest 
membership function 

cluster = Max(pi(x )) 

Until End Of File _ 

2) Mode function 

It is the majority vote, the concept of mode makes sense for 
any random variable estimating values from a vector space, 
containing the real numbers and the integers. The mode- 
function is quickly comprehensible and accessible to 
calculate. The clustered label is allocated according to the 
majority of the clustered data. 


Cluster label = mode ( output ) 
mode = argmax[0utput] 

4. PROPOSED Model 

K-means algorithm is based on determining an initial 
number of iterations, and iteratively reallocates objects among 
groups to convergence. The proposed model based on k-means 
and handled by map-reduce programming model. 

The proposed model in this paper consists of two phases as 
shown in Fig. 1 that namely, Mapper Phase and Reduce Phase. 
The first phase split the big dataset into small groups which 
called mapper according to RAM capacity. Next, the 
significant part had started when K-Means received the data 
from the mapper and return cluster label. 

The second phase called reducer phase. In this phase used 
the Fuzzy Gaussian algorithm and Mode function. It had 
started after receiving cluster label. So, it collects them to 
produce one output. 
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Fig 1 : The flow chart of proposed model. 


5. Experimental Results 

Data mining algorithms have two important performance 
indicators are the accuracy for cluster data and the time taken 
to apply the training. 

The propose approach had developed map reduce tool to 
implement the approach. The experiments are performed on a 
machine having Intel core i7 processor with 16 GB RAM and 
windows 10 OS. MATLAB R2014b is used in the 
experiments. 

A. Dataset 

In this paper, the datasets had taken from [10]. It is four 
openly available datasets; Table 1 shows the main features of 
these datasets [10]. 


TABLE I. Data set Details 


DatasetName 

Records- 

no. 

Attributes- 

no. 

Ciasses- 

no. 

"Covtype" 

581012 

54 

7 

"Covtype-2" 

581012 

54 

2 

"Poker Hand" 

1025009 

10 

9 

"Poker Hand-2" 

1025009 

10 

2 


The Covtype dataset contains 581012 sample to predict 
forest cover type from cartographic variables. Any individual 
relates to one of seven categories (classes) such as "Spruce/Fir, 
Lodgepole Pine, Ponderosa Pine, and Cotton woodAVillow." 
The second one is "Covtype-2". It is similar to Covtype except 
for the number of class (2 class). 


Each instance of the Poker-Hand dataset is an illustration 
of a hand containing five playing cards that drawn from a 
standard deck of 52. Suit and Rank are two attributes which 
represent every card, for a total of 10 predictive characteristics. 
The order of cards is essential, which is why there are 480 
possible Royal Flush hands rather than 4. Also, the “Poker 
Hand-2” is similar Poker Hand except the number of classes is 
two classes. 

B. Results 

In this part, experiment's results that are obtained after the 
implementation of K-means in mapper phase and using two 
different functions in reducer phase. The four different datasets 
had applied in the experiments. 


TABLE II. Accuracy and time taken by Training the K Mean 
and Mode Algorithm 


^Bstasets 

Method 

Covtype 

Covtype-2 

Poker 

Poker-2 

Accuracy (%) 

56.72 

62.1 

62.67 

63.2 

Time Taken 
(Ratio) 

7.20126 

6.725463 

6.012545 

6.21541 


The results obtained by using Mode function is shown in 
table 2. The proposed approach achieves 56.72% accuracy in 
time 7.20126 in case of using "Covtype" dataset which is 
considered the lowest accuracy and highest time taken. The 
accuracy improved by 5.38% when decrease the number of 
classes using "Covtype-2". 


TABLE III. Accuracy and time taken by Training the KM& 

Fuzzy Gaussian Algorithm 


D^asets 

Method 

Covtpe 

Covtype-2 

Poker 

Poker-2 

Accuracy (%) 

62.1 

75.6 

63.4 

73.4 

Time Taken 
(Ratio) 

8.01245 

8.12542 

7.124512 

7.124512 


The results obtained by using Fuzzy Gaussian are shown in 
table 3. The best accuracy is 75.6% using 
"Covtype-2" which enhance results achieved using "Covtype". 

Figure 2 shows the comparison between the mode and 
fuzzy function accuracies have been utilized in reducer phase. 
The results show improving using fuzzy Gaussian than mod 
function by leading to simple and straightforward linear 
algebra implementations. In case if using all " Covtype"," 
Covtype-2"," Poker "," Poker-2" respectively . The accuracy 
results indicate that Fuzzy Gaussian is better that Mode 
function this probably because of allowing one to quantify 
uncertainty in predictions resulting not just from intrinsic noise 
in the problem but also the errors in the parameter estimation 
procedure. 
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Fig. 2: Accuracy Comparsion for four data set 
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Fig.3: Run Time Comparsion for four data set 


Time taken comparisons between the mode and fuzzy 
function is shown in Figure 3. The results show that the time 
taken by fuzzy Gaussian is higher than mod function time this 
probably because of all calculation takes too much time to 
calculate it more than Mode function. 

6. CONCLUSION 

The proposed approach is based on the MapReduce 
programming model. It consist of two phases, Mapper and 
Reduce phases. In Mapper phase; it had distributed to a 
mappers group using the map function. K-means is applied on 


small datasets which existed in mappers. In Reduce phase; the 
reduce function is resulted by combining outputs using "Mod" 
and "Fuzzy Gaussian" functions. Gaussian function includes 
mixed membership; each cluster can have unconstrained 
covariance structure. Think of rotated or elongated distribution 
of points in a group. The cluster assignment is flexible. All 
instance belongs to each cluster to a different degree. The 
degree is according to the probability of the instance which 
generated from each cluster’s (multivariate) normal 
distribution. Experimental results showed that the proposed 
approach gives higher accuracy when using "Fuzzy Gaussian" 
function than using "Mod" function, as well as perfect time was 
taken. Also, Fuzzy Gaussian proved its efficiency in accuracy 
than Mod but with more time in execution. 
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